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1 . INTRODUCTION 


In the practical applications of pattern recognition, such as in remote sensing, 
there is considerable interest in the use of linear classifiers because they 
are simple and because fewer parameters need to be estimated. In many 
cases, it is required to estimate the probability of error in addition to 
designing the classifier. (For example in remote sensing, a separate set 
of labeled patterns is used in estimating the probability of error.) For 
designing the classifiers, the labels of the training patterns need to be 
obtained, and often acquiring labels is expensive. Hence, available training 
samples should be effectively used for designing the classifier and esti- 
mating the probability of error. 

The leave-one-out method (ref. 1) is proposed in the literature as an effec- 
tive way of estimating the probability of error from the training samples. 

The method is as follows. If there is a total of N-labeled patterns, leave 
out one pattern, design the classifier on remaining (N - 1) patterns, and 
test on the pattern that is left out. Repeat this process N times, every 
time leaving a different pattern, and then estimate the probability of error 
as an average of these errors. Use of this method, however, requires N 
classifiers to be designed. Fukunaga and Kessell (ref. 2) present a computa- 
tional method for estimating the probability of error of a Bayes classifier 
using the leave-one-out method. Chittineni (ref. 3) developed a computa- 
tional technique based on eigen perturbation theory for estimating the proba- 
bility of error of the Fisher classifier using the leave-groups-out method. 

This paper considers the Fisher classifier (refs. 4 and 5). The Fisher 
classifier is one of the most widely used linear classifiers. Computational 
expressions are developed based on matrix theory for estimating the proba- 
bility of error of the Fisher classifier using the leave-one-out method. 

This paper is organized as follows. 

Section 2 briefly presents the Fisher classifier. Section 3 develops compu- 
tational expressions for using the leave-one-out method for estimating 
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Fisher's error probability. Section 4 discusses the effect of the Fisher 
threshold and presents expressions for obtaining the optimal threshold by 
minimizing the probability of error. Section 5 presents a simple generaliza- 
tion of the Fisher classifier to multiple classes. Section 6 develops compu- 
tationally efficient expressions for the estimation of multicategory Fisher 
error using the leave-one-out method. Some matrix relations used in the 
paper are derived in the appendix (ref. 6). 

2. FISHER CLASSIFIER 

The Fisher classifier is a linear classifier that uses a direction W for the 
discriminant function. 


g(x) = w T x - t (i) 

so that when the training patterns are projected onto this direction, the 
intraclass patterns are clustered and the interclass patterns are separated 
to the extent possible as depicted in figure 1. 
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Let X^eco.j , k = 1, 2, . . . , N.. , i - 1 , 2 be the training pattern set. The 
unbiased estimates of means in., and covariance matrices S. of the patterns 
in the classes c^. are given by the following: 1 



j= 


(2a) 


A 




(2b) 


The Fisher classifier chooses the weight vector W, such that the criterion £ 
is maximized, where 


where S' = ^ 
be 



(3) 


The weight vector W, which maximizes £, can be shown to 



The Fisher threshold t is chosen as 


(4) 


t = 




(5) 


■The direction W and the threshold t are illustrated in figure 1. Fisher’s 
decision rule is as follows: 

Decide Xcca^ if g(X) > 0 

Decide Xew 2 if g(X) < 0 



3. RECURSIVE RELATIONS FOR THE FISHER WEIGHT VECTOR AND THRESHOLD 

In this section, computational -expressions are developed for using the leave- 
one-out method with the Fisher classifier. The justification for the 
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leave-one-out method for estimating the probability of error is as follows. 

In general, the probability of error, e, is a function of two arguments: 

£ ( 0 1 * e z) (7) 

where 0-, is the set of parameters for the distributions used to design the 
classifier and 0 2 is the set of parameters for the distributions used to test 
the performance. Let 0 and 0 be the set of true parameters and their estimates. 
The 0 is a random^vector that depends on the particular sample used in its 
estimation. Let § N be a particular value of 0. Then (from ref. 7), 

c{0, 0) < e(0 N , ©) ( 8 ) 


Taking expectations on both sides, one gets 

e(Q, 0) < E [e (© N , 0 ) 


(9) 


One of the ways of estimating the quantity on the RHS of equation (9) is with 
the leave-one-out method described in section 1. Presented in the following 
paragraphs are computational expressions for implementing the leave-one-out 
method with the Fisher classifier described in section 2. The cases in 
which a pattern xj from class ^ is left out and in which a pattern from 
class w 2 is left out can be treated similarly. 

Let a pattern xj from class o> 1 be left out and the patterns from class w 2 
remain. The means , i = 1, 2 and the covariance matrix Z 2 are defined as 
in equations (2a) and (2b). Define the covariance matrix of the total 
pattern set from class as 
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Note that E-^ is defined differently from the usual unbiased estimate for 
covariance matrices for mathematical simplicity; this definition will not 
affect the results. Now compute W and t as 


and 



t 




( 12 ) 


(13) 


When a pattern from class oj-j is left out, the unbiased estimates of the 
mean, and the covariance matrix E-^ of the patterns in class w-j ape 
given by the following: 

1 ^ 

m lk " - Ij £ x] (14) 

and 


■A. 1 » -J 

(is> 

Let Vlk = ^1 k + V Then the Fisher weight vector W ](< and threshold t 1|<5 
when a pattern from class is left out, are given by 

W lk = VlkKk " m 2) 


t 


Ik 



(17) 


Expressions are now developed for the computation of and t-^ in terms of 
W and t. The relationships between m^, E lk , and rn-j , E-j , and § w can be 
shown to be as follows (see the appendix): * 



A 




( 18 ) 


5 



09 } 



From equation (18), one obtains 


- ”2 ) - (N7^T]-( X k 

From equation (-20), one obtains (appendix) 



where 




( 20 ) 

(21) 

( 22 ) 

(23)" 


y (4)^K-*v) <»> 

B ( x k) = ( x k- s i) Ts ii 1 ( x k- s i) (25) 

v(4) - 1 - ob(x’) (26) 

( A /\ i 

m-. + m 2 ) 

2 1 (27) 

z(xj) = Y^x’)^ - in 2 ) (28) 


Using the definitions of equations (23) to (28), one obtains the following. 
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Equations (29} and (30) can be used to compute W-j^ and t-^ from W and t, 
every time that a pattern is left out from class w-j and the pattern is 
tested. Similarly, recursive expressions can be derived when a pattern Xj; 
is left out from class It is to be noted that because the covariance 
matrices are defined as in equation (10), the matrix S^ is to be computed 
and inverted twice, once when patterns from class are left out and again 
when patterns from class are left out. 

4. SELECTION OF AN OPTIMAL THRESHOLD 

This section considers the problem of finding, the optimum threshold, t, to 
achieve minimum probability of error for the projected patterns onto 
Fisher's direction. The patterns in class w. are assumed to be normally 
distributed; i.e., p^X|io^~ N^nu, Let y be the projection of pattern X 

onto Fisher's direction W; i.e., 

y = W T X (31) 
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Since X is normally distributed, y is also normally distributed; i.e., 

P(y|“i) ~ N (v a i )’ 1 = 1.2 (32) 


where 


and 


U 1 - = W m. 


2 T 

of = w'z.W 


(33) 


(34) 


If Fisher's decision rule is used, decide yeu^ if y > t; otherwise decide 
yeu) 2 , the probability of error incurred can be written as 


P e ■ P ! 


p (y I w -j ) d y 


+ P 


2 1 p(yN 2 ) d y 

t 


= p. 


t-U-j 

ij- *(?)<*? + P z j t ^ *(c)dc 


(35) 


where <j>(c) -j= exp^- ^ £ 2 j and P.. are the a priori probabilities of the 
classes uk, i = 1, 2. On differentiating equation (35) with respect to t, 
the following is obtained: 


9P 




ft - V 


11 


at ‘Ha 


P 9 4i( 


ft - y 


2\l 


9P, 


1 /°1 2 \ °Z j ° 2 


Equating to zero and then simplifying it, one obtains 

y 




\ u 2 / \ CT 1 

The following cases are considered: 


,P 1 a 2, 


(36) 


(37) 


Case (1): P ] = P 2 , CT] = 
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Obtained from equation (37) is the optimum value of t that minimizes the 
probability of error for Fisher's direction as 

P-, + U 2 

t ? (38) 


Equations (13) and (38) show that this is the threshold that is often imple- 
mented with the Fisher classifier. 

Case (2): f P 2 , = a 


In this case, the optimum value, of threshold t can be obtained from 
equation (37) as 


t = 




+ 




Case (3): P^ f P^, ^ f ct 2 


(39) 


On simplification, the following is obtained from equation (37): 



This is a quadratic equation of the form at 2 + bt + c = 0. The discriminant 
of the equation n = b 2 - 4ac can be shown to be 


(°_X _ °2\ 

W a l/ 

From equation (41), it is seen that when P^ = P^, n is always positive, thus 
giving real roots for equation (40). Even when P-j f ? z , if q is positive, 
real roots are obtained for t. The n is negative when there exists no real 
threshold that minimizes the probability of error. Equation (40) gives two 
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roots for t. Since P is continuous in t, the t that minimizes P can be 

« 0 

obtained by looking at the second derivative of P . Differentiating equa- 
tion (36) with respect to t, one obtains 



The root of equation (40) that gives a positive value for equation (42) is 
taken as the value of t, which minimizes the probability of error. Using the 
results of the last section, one can update the threshold t for use with the 
leave-one-out method since it is a function of means and covariance matrices. 


5. GENERALIZATION OF THE FISHER CLASSIFIER TO MULTIPLE CLASSES 


Rewriting equations (12) and (13) in terms of the discriminant functions 
g^(X) = V. X + v^, i = 1, 2, the following decision rule is implemented: 


Decide Xew-j if g^X) > g 2 (X) 
Decide Xew^ if g-j(X) < g^(X) 

Thus 

1 /\ 

. v i • V™i 
and 


(43) 

(44) 


(45) 




/s. 




(46) 


It is seen that equations (43) to (46) implement the decision rule of 
equation (6). This suggests the definition of discriminant functions for 
an M-class problem as 


g i (x) = vlx + v., i = i, 2, ... 
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where 



Then the decision rule is the following: Decide Xew. if 

g n - (x ) > g j ( x ) 

j = 1, 2 M (48) 

t i 


6. COMPUTATIONAL EXPRESSIONS FOR THE LEAVE-ONE-OUT METHOD 

IN A MULTI CLASS CASE 


This section presents computational expressions for the leave-one-out method 

for updating V. and v.. Let there be M classes. Consider the case when a 
1 1 1 

pattern X^ from class is left out. Define the means and covariance 
matrices of the total pattern set as 
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The and E., i - 2, M are defined as in equation (49). Proceeding 
as in section 3.1, one obtains recursive relations for Fisher's parameters 
as follows: 
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where 



Recursive relations can be obtained similarly when a pattern xj from class w. 
is left out. It is to be noted that the matrix is to be inverted once 

for each class. The use of these recursive relations results in a computa- 
tionally efficient way of implementing the leave-one-out method. 

7. CONCLUSIONS 

The Fisher classifier is one of the simplest and most widely used linear 
classifiers. Recently, considerable interest in its application for the 
classification of multispectral data acquired by Landsat has been expressed. 
Acquiring labels of the training patterns is expensive, and in many cases 
the probability of error is to be estimated in addition to the designing 
of a classifier. (For example in remote sensing, a separate set of labeled 
patterns is used for estimating the probability of error.) Hence, in prac- 
tical applications, it is advantageous to use the available labeled patterns 
more effectively. 
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This paper has presented computational expressions for estimating the proba- 
bility of error using the leave-one-out method. Thus, the available labeled 
patterns can be used effectively, both for designing the classifier and esti- 
mating the probability of error. Since the classification accuracy depends 
on the threshold used with the Fisher classifier, expressions for optimal 
threshold for minimizing the probability of error in Fisher’s direction are 
presented. 
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APPENDIX A 

DERIVATION OF MATRIX RELATIONS 


From equation (14), one obtains 



1 - 1 
m n - X 


N-j - 1 ‘"1 A k- 


1 


nii — Tvi tt'^X^ — m-j j 


■1 ' pp-Tj 

thus obtaining equation (18). From equation (15), 


^ (xl - ™ lk )(x] 




m 


1 k> 


J = 
3*k 


,N 1 

£ 

j-i 


x] - m lfc )(x] 


- m 


Ik 


-t 


X k ' m lk. 


X k ~ m lk, 


(A-l) 


(A— 2) 


A-l 



Consider the following: 



A- 2 



Substituting equations (A-3) and (A-4) into (A-2) results in the following: 


J lk in 




(n, - 2)1, 




- i 


2Tk “1/ A k '"1 


1 /J 


( H1 - 1 ) 2 ^ ‘"1 
N 


K - m 1 |(X k - iri-j 


)' 


= 2 , - 


1 

thus obtaining (19). 


( X k m l)( X k " m l) 


(A-5) 


Let S = £ - aMM^, where S and £ are nonsingular matrices and M is a vector. 
Then the inverse of S can be expressed in terms of the inverse of £ as 
in reference 6: 



-1 T -1 
+ a£ MM E 1 

1 - aM T £~ ] M 


thus obtaining equation (22). 


(A-6) 


A-3 
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